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Abstract 


We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors 
under a generalized and unified asymptotic regime, which takes into account the spike 
magnitude of leading eigenvalues, sample size, and dimensionality. This new regime 
allows high dimensionality and diverging eigenvalue spikes and provides new insights 
into the roles the leading eigenvalues, sample size, and dimensionality play in principal 
component analysis. The results are proven by a technical device, which swaps the role 
of rows and columns and converts the high-dimensional problems into low-dimensional 


ones. Our results are a natural extension of those in Paul (2007) to more general 


setting with new insights and solve the rates of convergence problems in Shen et al. 


(2013). They also reveal the biases of the estimation of leading eigenvalues and eigen¬ 


vectors by using principal component analysis, and lead to a new covariance estimator 
for the approximate factor model, called shrinkage principal orthogonal complement 
thresholding (S-POET), that corrects the biases. Our results are successfully applied 
to outstanding problems in estimation of risks of large portfolios and false discovery 
proportions for dependent test statistics and are illustrated by simulation studies. 
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1 Introduction 


Principal Component Analysis (PCA) has widely been used as a powerful tool for dimen¬ 
sionality reduction and data visualization. Its theoretical properties such as the consistency 
and asymptotic distributions of empirical eigenvalues and eigenvectors are challenging espe¬ 
cially in high dimensional regime. For the past half century substantial amount of efforts 
have been devoted to understanding empirical eigen-structures. An early effort is |Andersoii 


(1963) who established the asymptotic normality of eigenvalues and eigenvectors under the 
classical regime with large sample size n and hxed dimension p. However, as dimensionality 
diverges at the same rate as the sample size, sample covariance matrix is a notoriously bad 
estimator with substantial different eigen-structure from the population one. A lot of recent 
literatures make the endeavor to understand the behaviors of eigenvalues and eigenvectors 


under high dimensional regime where both n and p go to inhnity. See for example Baik 


et al. (2005); Bai (1999); Paul (2007); Johnstone and Lu (2009); Onatski (2012); Shen et ah 


(2013) and many related papers. For additional developments and references, see Bai and 


Silverstein (2009). 


Most of studies focus on the situations where signals are weak or semi-weak (Onatski 


2012) with leading asymptotic eigenvalues bounded (Pauh 2007 Bai and Silverstein, 2009) 


or slowly growing (Onatski, 2012| ). However, Fan et al. (2013) shows that for factor models 
with pervasive factors, the leading eigenvalues can grow linearly with the dimensionality and 
hence their corresponding eigenvectors can be consistently estimated as long as sample size 
diverges. This leads to the question of how the asymptotics of engen-structure depends on 
the interplay of spike magnitude of leading eigenvalues, dimensionality, and sample size. An 
interesting study on this topic is Shen et al. (2013), which focuses only on the consistency of 
the problem. The question then arises naturally on the rates of convergence and asymptotic 
structures of empirical eigenvalues and eigenvectors. This is the subject of this study. 

In this paper, we consider a high dimensional spiked covariance model with the hrst 
several eigenvalues significantly larger than the rest. Typically, the spike part is of importance 
and of interest. We provides new understanding on how the spiked empirical eigenvalues and 
eigenvectors fluctuate around their theoretical counterparts and what their asymptotic biases 
are. For the spiked covariance model, three quantities play an essential role in determining 
the asymptotic behaviors of empirical eigen-structure: the sample size n, the dimension p, 
and the magnitude of leading eigenvalues Theoretical properties of PCA have been 

investigated from three different perspectives. 

The hrst angle is through a low-rank plus sparse decomposition, where the covariance 
matrix is perceived as the sum of a low-rank and a sparse matrix. The low-rank part con- 
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tributes to the signal to be recovered whereas the sparse part serves as noise. For example 
in Fan et ah (2008), the low-rank matrix corresponds to the dependence induced by the 


common factors or covariates whereas the sparse matrix corresponds to the idiosyncratic 
noise. In noiseless setting, Candes et ah (2011) considered the principal component pursuit 
and showed that it can recover the decomposition structure under the incoherence condition. 


Chandrasekaran et ah (2011a) also studied the sufficient condition for exact recovery of the 


low-rank and sparse matrices. The noisy decomposition recover was considered more thor¬ 


oughly by Agarwal et al. ( 2012 ). In addition, a large amount of literature has contributed 


to the topic of sparse PCA, for example Amini and Wainwright (2008); Vu and Lei (2012); 


Birnbaum et al. (2013); Berthet and Rigollet (2013); Ma (2013), which leverages the extra 


assumption on the sparsity of eigenvectors. Specihcally, Cai et al. (2013b) studied the mini¬ 
max optimal rates for estimating eigenvalues and eigenvectors of spiked covariance matrices 
with jointly fc-sparse eigenvectors. This type of work assumes bounded eigenvalues, which 
limit the signals we can get from the data. Correspondingly, those works require additional 
eigenvector structure to reduce the possibility of noise accumulation such as incoherence or 
jointly fc-sparse or other similar conditions. In this paper, thanks to the diverging eigenvalue 
regime we will consider, our conclusions will not rely on additional structure of eigenvectors, 
which can be hard to verify in practice. 

A different line of efforts is to analyze PCA through random matrix theories, where it is 
typically assumed p/n —)■ 7 G (0, 00 ) with bounded spike sizes. It is well known that if the 
true covariance matrix is identity, the empirical spectral distribution converges almost surely 

[1999 ) and when 7 < 1 the largest and smallest 


to the Marcenko-Pastur distribution 


eigenvalues converge almost surely to (1- 1 -^/ 7 )^ and (1 — .^)^ respectively (Bai and Yin 


1993 


Johnstone, 2001). If the true covariance structure takes the form of a spiked matrix, Baik 


et al. (2005) showed that the asymptotic distribution of the empirical eigenvalues exhibit 


an scaling when the eigenvalue lies below a threshold 1 -|- ^ 7 , and an scaling 

when it is above the threshold. For the case where we have the regular scaling, Paul (2007) 
investigated the asymptotic behavior of the corresponding empirical eigenvectors and showed 
that the major part of an eigenvector which corresponds to the spiked eigenvalues is normally 
distributed with regular scaling The convergence of principal component scores under 
this regime was considered by Lee et al. (2010). The same random matrix regime has also 


been considered by Onatski (2012) in studying the principal component estimator for high¬ 
dimensional factor models. More recently, Koltchinskii and Lounici ( 2014b|[a ) revealed a 
profound link of concentration bounds of empirical eigen-struecture with the effective rank 
dehned as f = tr(S)/Ai (Vershynin, 2010). Their results extend the regime of bounded 
eigenvalues to more general setting, although the asymptotic results in most cases still rely 
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on the assumption f = o{n). In this paper, we consider the regime p/{nXi) < oo, which 
implies f = 0(n). More discussions will be given in Section]^ 

Deviating from the classical random matrix and sparse PCA literature, we consider the 
ultra-high dimensional regime allowing p/n —)■ cx). If p/n —>■ oo, to ensure sufficiently strong 
signal for PCA, it is natural to also have the spike sizes go to inhnity, namely, Xj — )■ oo 
for the hrst m leading eigenvalues. This leads to the third perspective for understanding 
PCA from this ultra high dimensional setting. Shen et al. (2013) adopted this point of 
view and considered the regime of p/{nXj) — >■ Cj where 0 < Cj < oo for leading eigenvalues. 
This is more general than the bounded eigenvalue condition. Specihcally if eigenvalues are 
bounded, we require the ratio p/n converges to a bounded constant. On the other hand, if 
the dimension is much larger than the sample size, we offset the dimensionality by assuming 
increased signals. In particular, the pervasive factor model considered in economics and 
hnance factor model corresponds to Cj = 0 with the pervasive leading eigenvalues Xj x p. 


see for example Fan et al. (2013, 2014); Stock and Watson (2002); Bai (2003); Bai and Ng 


(2002). The weak factor model considered by Onatski (2012) also implies Cj = 0, with p/n 
bounded and Xj x p^ for some 6 G (0,1). Hall et al. (2005); Jung and Marron (2009) started 


the research of high dimension low sample size (HDLSS) regime. With n hxed, Jung and 


Marron (2009) concluded that consistency of leading eigenvalues and eigenvectors is granted 


if Xj X p® for 9 > 1, which also corresponds to Cj = 0. Shen et al. (2013) revealed an 
interesting fact that when Cj ^ 0, spiked sample eigenvalues almost surely converges to a 
biased quantity of the true eigenvalues; furthermore the corresponding sample eigenvectors 
show an asymptotic conical structure. We will consider the same regime as theirs, but focus 
more on the asymptotic distributions of the eigen-structure, which was not covered in their 
paper, and under more relaxed conditions. Our results can be seen as a natural extension of 


Paul (2007) to ultra high dimensional setting. 


In addition to the different regimes we take on, we also introduce a simple technique to for 
our technical proofs. The idea is to flip the roles of rows and columns and treat p as the sample 
size and n as the dimension. When p is higher than n, sample covariance is clearly degenerate. 
Switching the roles of n and p allows us to utilize the existing results on eigen-structures. 
To be specihc, if we have n samples generated from A^(0, D) where D = diag((ii,..., dp) is 
diagonal, then all the information we have is just an n by p data matrix with independent 
entries. We can simply treat the data as p independent vectors of dimension n each with 
distribution N{0,diln)- Even when the data are not normally distributed and hence p n- 
dimensional vectors are then not independent, the idea is still powerful and leads to better 
understanding of relationship between high and low dimensionality. The simple trick has 


been used to derive asymptotic results of empirical eigenvalues in recent papers such as Shen 














































et al. (2013); Yata and Aoshima (2012, 2013). One of our contributions lies in successful 


application of the trick to study the empirical leading eigenvectors. 

The rest of the paper is organized as follows. Section introduces the notations, as¬ 
sumptions, and an important fact which serves as basis of our proofs. The fact will help 


unravel the relationship between high and low dimensions. Sections 3.1 and 3.2 devote to 
the theoretical results of the sample eigenvalues and eigenvectors of the spiked covariance 
matrix under our asymptotic regime. In Section we discuss several applications of the 
theories in the previous section. Firstly a new covariance estimator for the approximate 
factor model, named shrinkage principal orthogonal complement thresholding (S-POET), is 
proposed which corrects the biases of empirical eigenvalues. Secondly, S-POET will be suc¬ 
cessfully applied to outstanding problems in estimation of risks of large portfolios and false 
discovery proportions for dependent test statistics. For both problems, the typical assump¬ 
tion on the signal strength of leading eigenvalues in order to deploy factor analysis is relaxed 
due to our new results in Section In Section simulations are conducted to illustrate the 
theoretical results at the hnite sample. The proofs for Section are provided in Section 
and those for Section]^ are relegated to the supplementary material. 


2 Assumptions and a simple fact 


Asssume that {YJILi is a sequence of i.i.d. random variables with zero mean and 
covariance matrix Spxp- Let Ai,..., Ap be the eigenvalues of S in descending order. We 
consider the spiked covariance model as follows. 


Assumption 2.1. Ai > A2 > ■ ■ ■ > Am > Am+i > ■ • ■ > Ap > 0, where the non-spiked 
eigenvalues are bounded, i.e. Cq < Aj < Co,j > m for constants Co,Co > 0 and the spiked 
eigenvalues are well separated, i.e. 35o > 0 such that minj<m(Aj — Aj+i)/Aj > 60. 


The eigenvalues are divided into the spiked ones and bounded non-spiked ones. We do 
not have specihc order assumptions on the leading eigenvalues nor require them to diverge. 
Thus, our results in Section are applicable to both bounded and diverging leading eigen¬ 
values; if diverging, they can have different diverging rates. For simplicity, we only consider 
distinguishable eigenvalues (multiplicity 1) for the largest m eigenvalues and a hxed number 
m, independent of n and p. 

The spiked covariance model is motivated by the factor model y = Bf -|- e considered 
by Fan et al. (2013) as follows. Assume without loss of generality that var(f) = Im, the 
m X m identity matrix. Then, the model implied covariance matrix S = BB' -|- Sg, where 
Sg = var(£). If the factor loadings {bj} (the transpose of rows of B) are an i.i.d. sample 
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from a population with mean zero and covariance S;,, then by the law of large numbers, 
bib' —)■ Sfe. In other words, the eigenvalues of BB' are approximately 


pAi(Sb)(l + o(l)), • • ■ ,pAm(Sfe)(l + o(l)), 0, • • • ,0, 

where Aj(Sft) is the eigenvalue of S;,. If we assume that IIS^H is bounded, then by Weyl’s 
theorem, we conclude that 


\j = pAj(Sfe)(l + o(l)), for j = 1, ■ ■ ■ ,m. 


( 2 . 1 ) 


and the remaining is bounded. 

In the spiked covariance models, three essential factors come into play: the sample size n, 
dimension p and the spikeness A^-’s. The following relationship is assumed as in 

Assumption 2.2. Assume p > n. For the spiked part 1 < j < m, Cj = p/{nXj) is bounded, 
and for the non-spiked part, {p — m)~^ Yl^=m+i ~ ^ + o{n~^^^). 

We allow p/n —)■ oo in any manner, though Xj also needs also grow fast enough to ensure 
bounded Cj. In particular, Cj = o(l) is allowed as in the factor model. We do not assume 
the non-spiked eigenvalues are identical, as in most spiked covariance model literature (e.g. 

By spectral decomposition, S = FAT', where the orthonormal matrix F is constructed 
by the eigenvectors of S and A = diag(Ai,..., Ap). Let Xj = F'Yj. Since the empirical 
eigenvalues are invariant and the empirical eigenvectors are equivariant under an orthonormal 
transformation, we focus the analysis on the transformed domain of Xj and the results can 
be translated into the original data. Note that var(Xj) = A. Let Zj = A ^'^^Xj be the 
elementwise standardized random vector. 

Assumption 2.3. {Zj}(L^ are i.i.d copies of Z. The standardized random vector Z = 
(Zi, ..., Zp) is sub-Gaussian with independent entries of mean zero and variance one. The 
sub-Gaussian norms of all components are uniformly bounded: maxj ||.^j||p 2 ^ Cq, where 

Since Var(Xj) = diag(Ai, A2 ,..., Ap), the hrst m population eigenvectors are simply unit 
vectors ei, 62 ,..., 6^- Denote the n by p transformed data matrix by X = (Xi, X2 ,..., X„)'. 
Then the sample covariance matrix is 

1 1 ” 

Spxp = -x'x = - Vx,x', 

n n 


Paul (2007); Johnstone and Lu (2009 


Shen et ah 
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whose eigenvalues are denoted as Ai,A 2 ,...,Ap (A^ = 0 for j > n) with corresponding 

. . . . (V) 

eigenvectors ^21 ■ ■ ■ i ^p- Note that the empirical eigenvectors of data Yds are c- = 

Let Zj be the column of the standardized X. Then each Zj has i.i.d sub-Gaussian 
entries with zero mean and unit variance. Exchanging the role of rows and columns, we get 
the n by n Gram matrix 

1 1 ^ 

E„xn = -XX' = -VA,Z,Z' 
n n ^ 

i=i 

with the same nonzero eigenvalues Ai, A2 ,..., A„ as S and the corresponding eigenvectors 
Ui, U2 ,..., u„. It is well known that for i = 1, 2,..., n 

= (nAi)“^^^X'ui and u* = (nA^'^^^X^^, (2.2) 


while the other eigenvectors of S constitute a (p — n)-dimensional orthogonal complement 

of ^1,. ■ ■ 


By using this simple fact, for the specihc case with cq = Gq = 1 in Assumption 2.1 


A,- = 1 for j > m in Assumption 2.2, and Gaussian data in Assumption 2.3, Shen et ah 


(2013) showed that 


and 


a-.s. , 1 ^ ^ 

-- yl + Cj, l<j<m; 

Ai 


(G,e,) n-(l + c,)-s 


where (a, b) denotes the inner product of two vectors. However, they fail to establish any 
results on convergence rates or asymptotic distributions of the empirical eigen-structure. 
This motivates the current paper. 

The aim of this paper is to establish the asymptotic normality of the empirical eigenvalues 


and eigenvectors under more relaxed conditions. Our results are a natural extension of Paul 


(2007) to more general setting with new insights, where the asymptotic normality of sample 
eigenvectors is derived using complicated random matrix techniques for Gaussian data under 
the regime of p/n —)■ 7 G [0,1). Gompared to them, our proof, based on the relationship 
(2.2), is much simpler and insightful for understanding the behavior of ultra high dimensional 
PGA. 


Here are some notations that we will use in the paper. For a general matrix M, we denote 
its matrix entry-wise max norm as ||M||inax = and dehne the quantities ||M|| = 

AmiL(M'M), IlMlIiT = ||M||oo = max* |Mij| to be its spectral, Frobenius 

and induced £00 norms. If M is symmetric, we dehne Aj(M) to be the largest eigenvalue 
of M and Ainax(M); •^min(M) to be the maximal and minimal eigenvalues respectively. We 
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denote tr(M) as the trace of M. For any vector v, its £2 norm is represented by ||v|| while 
ii norm is written as ||v||i. We use diag(v) to denote the diagonal matrix with the same 
diagonal entries as v. For two random vectors a, b of the same length, we say a = b + Op(5) 
if ||a — b|| = Op{6) and a = b + op{S) if ||a — b|| = op{6). We denote a 4» £ for some 
distribution C if there exists b ~ £ such that a = b + op(l). In the following, C is a generic 
constant that may differ from line to line. 


3 Asymptotic behavior of empirical eigen-structure 


3.1 Asymptotic normality of empirical eigenvalues 

Let us hrst study the behavior of the hrst m empirical eigenvalues of S. Denote by Aj(A) 
the largest eigenvalue of matrix A and recall that Xj = Aj(S). We have the following 
asymptotic normality of Xj. 


Theorem 3.1. Under Assumptions 
tions. In addition, 


2.1 


2.3, {Xj}Y^i’s have independent limiting distribu- 


\/n|^ - ^1 + ccj + Op(\j Vp/^))} ^ N{0,Kj - 1), 


(3.1) 


where Kj is the kurtosis of Xj 


The theorem shows that the bias of Xj/Xj is cCj + Op{Xj ^\/p/n). The second term is 
dominated by the hrst term since p > n and it is of order op(n“^/^) if y/p = o{Xj). The latter 


assumption is satished by the strong factor model in Fan et ah (2013) and a part of weak 


factor model in Onatski (2012). To get the asymptotically unbiased estimate, it requires 


Cj = p/{nXj) —?■ 0 for j < m. This result is more general than that of Shen et ah (2013) 

( |2014b||at 


and sheds a similar light to that of 


Koltchinskii and Lounici 


i.e. 


IS - S|| ^ 0 


almost surely if and only if the effective rank r = tr(S)/Ai is of order o{n), which is true 


when Cl = o(l). Yata and Aoshima (2012, 2013) employed a similar technical trick and gave 
a comprehensive study on the asymptotic consistency and distributions of the eigenvalues. 
They got various similar results under different conditions from ours. Our framework is more 


general and bias reduction can also be made by using a different method; see Section 4.2 In 


addition, under the typical spiked covariance model as in Baik et al. (2005), Johnstone and 


Lu (2009) and Paul (2007), where it is assumed Xj = cq = Co,j > m, we have c = cq equal 


to the minimum eigenvalue of the population covariance matrix. The theorem reveals the 
bias is controlled at the rate p/{nXj). Our result is also consistent with Anderson (1963)’s 






































result that 


n(A,-A,j =^iV(0,2A;), 

for Gaussian distributions and fixed p and A^’s, where the non-spiked part does not exist 
and thus the bias Op{X~^^^/p/n) disappears. The proof is relegated to Section 6 


3.2 Behavior of empirical eigenvectors 

Let us consider the asymptotic distribution of the empirical eigenvectors correspond¬ 
ing to Xj, j = 1,2 ,... ,m. As in Paul (2007), each is divided into two parts corresponding 
to the spike and non-spike components, i.e. = {kjAihjB)' where is of length m. 


Theorem 3.2. Under Assumptions 2U_ - 2^, we have 
(i) For the spike part, ifm = l, 

2(1 + cci) j— f 


CCi 


- 1 + Op(y^^)) =4 77(0, - 1), (3.2) 


while if m > 1, 


sjA 


— ejA + 0} 


sjAl 


p 

nX^^ 


Am(0, Sj) , 


(3.3) 


for j = 1,2 ,... ,m, with 




where [m] = {1, • • • ,m}, e^^ is the first m elements of unit vector e^, and Ojk = 
hmAj,Afe^oo \/^j^k/{Xj — Xk), which is assumed to exist. 

(a) For the noise part, if we further assume the data is Gaussian, there exists p — m dimen¬ 
sional vector ho such that 


Do42^-ho 


11^ 


jB\ 


= Of 


p) Unif(^Bp_rn{l)^ , (3.4) 


where Dq = diag{ -y/c/Am+i, • • •, \/c/Xp) is a diagonal scaling matrix and Unif{Bk{r)) denotes 
the uniform distribution over the centered sphere of radius r. In addition, the max norm of 
satisfies 

/ X 


^jsllmax = Oph/{nXf^) + J log p/{nXj 


(Hi) Furthermore, ||^jvi|| = (1-1 -ccj) -|- Op{Xj ^\/pln + p/{n^^‘^Xj)) and 

= (p^^)^'^^ + Op{^yl/Xj + ^/pJ{n?Xj)). Together with (i), this implies the inner 
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product between empirical eigenvector and the population one converges to (1 + ccj) in 
probability and 

- / ^ _ = Op(^\-^^/y^ + p/{n^/^\j) \ +Op{n-^)I{m > 1). (3.6) 

W i + CCj ^ ' 


In the above theory, we assume that ajk = hmAj.,A 


.j.Afc^oo Aj-Afe 


exists. This is not restrictive 


if eigenvalues are well separated i.e. minj^fc<m |Aj — Afc|/Aj > 5o from assumption 2.1 


The assumption obviously holds for the pervasive factor model (Fan et ah, 2013), in which 

O'jk = \/ Aj(Sfe)Afc(Sb)/(Afc(Sfe) — Aj(Sfe)). 


Theorem |3.2| is an extension of random matrix results into ultra high dimensional regime. 
Its proof sheds light on how to use the smaller n x n matrix S as a tool to understand the 
behavior of the larger p x p covariance matrix S. Specifically, we start from Suj = AjUj 


or identity (6.3) and then use the simple fact (2.2) to get a relationship (6.4) of eigenvector 


Then (6.4) is rearranged as (6.5) which gives a clear separation of dominating term. 


that is asymptotically normal, and error term. This makes the whole proof much simpler 


in comparison with Paul (2007) who showed a similar type of results through a complicated 


representation of and Xj. From this simple trick, we can understand deeply how some 
important high and low dimensional quantities link together and differ from each other. 

Several remarks are in order. Firstly, since = F^^- is the empirical eigenvector 
based on observed data Y, we have decomposition 




sjB ) 


where F = (F^, F^). Note that TA$,jA converges to the true eigenvector deflated by a factor 


of 


+ cCj with the convergence rate Op{^Jp/ (nA^) +p/(n^/^Aj) + n while 






creates a random bias, which is distributed uniformly on an ellipse of {p — m) dimension and 
projected into the p dimensional space spaned by F^. The two parts intertwined in such a 
way that correction for the bias of estimating eigenvectors is almost impossible. More details 
are discussed in Section]^ for factor models. Secondly, it is clearly as in the eigenvalue case. 


the bias term Xj^^fpjn in Theorem 3.2 (i) disappears when ^ = o{Xj). In particular. 


lY) 

for the stronger factor given by (2.1), ^ - is a consistent estimator. Thirdly, the situations 


m = 1 and m > 1 have slight difference in that multiple spikes could interact with each other. 
Especially this reflects in the convergence of angle of empirical eigenvector to its population 
counterpart: the angle converges to (1 + ccj)~^^‘^ with an extra rate Op{l/n) which stems 
from estimating ^jk for j ^ k < m (see proof of Theorem 3.2 (iii)). The difference will only 


be seen when the spike magnitude is higher than the order y^pnVpn We will verify this 
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by a simple simulation in Section Fourthly, it is the hrst time that the max norm bound 
of the non-spiked part was derived. This bound will be useful for analyzing factor models in 
Section HI 


Theorem 3.2 again implies the results of Shen et ah (2013). It also generalizes the 


asymptotic distribution of non-spiked part from pure orthogonal invariant case of Paul (2007) 


to more general bounded setting. In particular, when p/n —>■ oo, the asymptotic distribution 
of the normalized non-spiked component is not uniform over a sphere any more, but over 
an ellipse. In addition, our result can be compared with the low dimensional case, where 


Anderson (|1963|) showed that 


NJO, 


A,A 


j^k 


^ (A, -Afc)2 


^k^k 


(3.7) 


for hxed p and Aj’s. Under our assumptions, if the spiked eigenvalues go to inhnity, the 
constants in the asymptotic covariance matrix are replaced by the limits ajkS. Similar to 
the behavior of eigenvalues, the spiked part preserves the normality property except for 
a bias factor 1/(1 -b ccj) caused by the high dimensionality. 


Recent manuscript by Koltchinskii and Lounici (2014a) provides general asymptotic re 


suits for the empirical eigenvectors from a spectral projector point of view, but they mainly 
focus on the regime of p/nXj —)■ 0. Indeed, they limit themselves to the regime that p = o{n) 
and Ai = 0(1) when establishing the asymptotic normality (see conditions for Theorems 5 
and 7 therein). In contrast, we consider a very different regime, requiring p > n and allowing 


Ai to diverge. Furthermore, Theorem 3.2 gives a more rehned description on the behavior of 


empirical eigenvectors than the asymptotic normality result given in Theorem 7 of Koltchin 


skii and Lounici 

(2014a 

). Last but not least, it has been shown by Johnstone and Lu ( 

2009) 


that PCA generates consistent eigenvector estimation if and only if p/n —)■ 0 when the spike 
sizes are hxed. This motivates the study of sparse PCA. We take the spike magnitude of 
eigenvalues into account and provide additional insights by showing that PCA consistently 
estimate eigenvalues and eigenvectors if and only if p/{nXj) —0. This explains why Fan 


et ah (2013) can consistently estimate the eigenvalues and eigenvectors while Johnstone and 


Lu (2009) can not 


4 Applications to factor models 

In this section, we propose a method named Shrinkage Principal Orthogonal complEment 
Thresholding (S-POET) for estimating large covariance matrices induced by the approximate 
factor models. The estimator is based on correction of the bias of empirical eigenvalues as 
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specified in (3.1). We derive for the first time the bound of the relative estimation errors of 
covariance matrices under the spectral norm. The results are then applied to assessing large 
portfolio risk and estimation of false discovery proportion, where the conditions in existing 
literature are relaxed. 


4.1 Approximate factor models 

Factor models have been widely used in various disciplines such as hnance and genomics. 
Consider the approximate factor model 

yu = + Uit, (4.1) 

where yu is the observed data for the {i = 1 ,... ,p) individual (e.g. returns of stocks) or 
components (e.g. expression of genes) at time t = 1,..., T; b is a m x 1 vector of latent 
common factors and bj is the factor loadings for the individuals or components; Uu is the 
idiosyncratic error, uncorrelated with the common factors. In genomics application, t can 
also index individuals or repeated experiments. For simplicity we assume there is no time 
dependency. 

The factor model can be written into a matrix form as follows: 


Y = BF' + U , 


(4.2) 


where Y^xt, Bpxm, F^xm, UpxT are respectively the matrix form of observed data, factor 
loading matrix, factor matrix, and error matrix. For identihability issue, we impose the 
condition that cov(b) = I and B'B is a diagonal matrix. Thus, the covariance matrix is 
given by 

S = BB' + S„, (4.3) 

where is the covariance matrix of the idiosyncratic error at any time t. 

Under the assumption that is sparse with its eigenvalues bounded away 

from zero and inhnity, the population covariance exhibit a “low-rank plus sparse” structure. 
The sparsity is measured by the following quantity 


nip 


max } 

i<p ' ^ 

j<p 


^u,ij 


q 


for some q G [0,1] (Bickel and Levina, 2008). In particular, rup with g = 0 is the maximum 
number of nonzero elements in each row of 


In order to estimate the true covariance matrix with the above factor structure. Fan 
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et al. (2013) proposed a method called “POET” to recover the unknown factor matrix as 


well as the factor loadings. The idea is simply to hrst decompose the sample covariance 
matrix into the spiked part and non-spiked part and estimate them separately. Specihcally, 
let S = T~^YY' and {A^} and be its corresponding eigenvalues and eigenvectors. They 
dehne 

m 

i=i 


where is the matrix after applying thresholding method ( |Bickel and Levina 2008) to 

They showed that the above estimation procedure is equivalent to the least square ap¬ 
proach that minimizes 


(B, F) = argmin ||Y — BF'||^ s.t. —F'F = 1^, B'B is diagonal. 

B,F I 


(4.5) 


The columns of F /^/T are the eigenvectors corresponding to the m largest eigenvalues of 
the T X T matrix T“^Y'Y and B = T“^YF. After B and F are estimated, the sample 
covariance of U = Y — BF' can be formed: = T“^UU'. Finally thresholding is applied 


T 


to to generate where 


-T _ 


(T. 




^ = J] 




(4.6) 


Here Sjj(-) is the generalized shrinkage function (Antoniadis and Fan, 2001 Rothman et al. 


2009) and Tij = T{au,iid'u,jjY^‘^ is the entry-dependent threshold. The above adaptive thresh¬ 
old corresponds to applying thresholding with parameter r to the correlation matrix of 
The positive parameter r will be determined later. 

Fan et al. (2013) showed that under Assumptions A.l - A.4 listed in Appendix [A| in the 


supplementary material (Fan and Wang, [2015 ), 

fy/p\ogp 


is"^ - Si 




= Of 


V T 


Aogp 1 \{ 1 - 9)/2 

+ -j 


(4.7) 


where ||A||j:,f = and || . (|f is the Frobenius norm. Note that 

|s,F = P~ 


IS^ - SIR p = - I 


p\\F, 


which measures the relative error in Frobenius norm. A more natural metric is relative error 
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under the operator norm ||A||s = p which can not be obtained by 


using 


the technical device of Fan et ah (2013). Via our new tools, we will establish such a result 


under weaker conditions than their pervasiveness assumption. Note that the relative error 
convergence is particularly meaningful for spiked covariance matrix, as eigenvalues are in 
different scales. 

4.2 Shrinkage POET under relative spectral norm 

The discussion above reveals several drawbacks of POET. First, the spike size has to 
be of order p which rules out relatively weaker factors. Second, it is well known that the 
empirical eigenvalues are inconsistent if the spike eigenvalues do not signihcantly dominate 
the non-spike part. Therefore, proper correction or shrinkage is needed. See a recent paper 


by Donoho et ah (2014) for optimal shrinkage of eigenvalues. 


Regarding to the hrst drawback, we relax the assumption ||p ^B'B — f2o|| = o(l) in 


Assumption A.l to the following weaker assumption. 


Assumption 4.1. ||A^^^^B'BA^^^^ — Poll = o(l) for some Pq with eigenvalues hounded 
from above and below, where Ka = diag{\i,..., \m) ■ In addition, we assume Xm —t oo, 
Xi/Xm is bounded from above and below. 

This assumption does not require the hrst m eigenvalues of S to take on any specihc 
rate. They can still be much smaller than p, although for simplicity we require them to 
diverge and share the same diverging rate. As we assume bounded ||Su||, the assumption 
Am —>■ oo is also imposed to avoid the issue of identihability. When Am does not diverge. 


more sophisticated condition is needed for identihability (Chandrasekaran et ah, 2011b). 


In order to handle the second drawback, we propose the Shrinkage POET (S-POET) 


method. Inspired by (3.1), the shrinkage POET modihes the hrst part in POET estimator 


(4.4) as follows: 


1=1 


(4.8) 


where AJ = max{Aj — cp/7T,,0}, a simple soft thresholding correction. Obviously if Xj is 
sufficiently large, Xj /Xj = Xj/Xj — ccj = 1 -|- op(l). Since c is unknown, a natural estimator 
c is such that the total of the eigenvalues remains unchanged: 


tr(S) = 

1=1 


cp/n) + (p — m)c 


or c = (tr(S) — ^j)/iP — nr — pm/n). It has been shown by Lemma 7 of 


Yata and 

















Aoshima (2012) that 


cp 


nXj {n — m)Xm nX 


= Op{n ). 


Thus, replacing c by c, we have Xj /Xj — 1 = Op{X~^^^/p/n + i.e. the estimation 

error in c is negligible. From Theorem |3.1 we can easily obtain asymptotic normality, that 
is ■\/n{Xj /Xj — 1) 4> A^(0, Kj — 1) if y/p = o(Aj). 

To get the convergence of relative errors under the operator norm, we also need the 
following additional assumptions: 

Assumption 4.2. (i) {ut,ft}f>i are independently and identically distributed with E[uit] = 
K[uitfjt\ = 0 for all i < p, j < rn and t <T. 

(a) There exist positive constants Ci and C 2 such that Amin(S„) > Ci, ||S„||oo < C2, and 
minij Var{uitUjt) > Ci. 

(Hi) There exist positive constants ri,r2, bi and 62 such that for s > 0,i < p, j < m, 

P(|Mit| > s) < exp{-{s/biY^) and P(|/jf| > s) < exp{-{s/b 2 YY ■ 


(iv) There exists M > 0 such that for all i < p, j < m, |6y| < M^^Xj/p. 

(v) ^(logT)i/'’2 = o{Xm)- 

The hrst three conditions are common in factor model literature. If we write B = 
(bi,... ,bm), by Weyl’s inequality we have maxi<j<m ||bj|p/Aj < 1 + ||S„||/Aj = 1 + o(l). 
Thus it is reasonable to assume the magnitude \bij\ of factor loadings is of order \fxYJp in 
the fourth condition. The last condition is imposed to ease technical presentation. 

^ 5' 

Now we are ready to investigate ||S — S||s. Suppose the SVD decomposition of S, 


^ I r^xm ^px{p—m) 


A. 




(p—m) X (p—m) 


r' 

Q,' 


Then obviously 


IS -SI 




S-5(fA^r -BB')S- 


1 . - T 


+ lls ^ - s„)s 2 


=: Ar+A, 


(4.9) 


and 


T 


T 


As< ||E-'||||S„ -E„|| <C||E„ -S 


(4.10) 
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It can be shown 


Ar, = 


A-^r' 


1 , (FA r - BB') 
e-^n' 


TA 


n& 


(4,11) 


< Ail 


A 


L2 


where A^i = ||A- 2 r'(rA T - BB')rA" 2 || and Al 2 = ||©' 2 ri'(rA T - BB')ri©-51|. 
Thns in order to hnd the convergence rate of relative spectral norm, we need to consider 
the terms A^i, Ai 2 and As separately. Notice that A^i measnres the relative error of the 
estimated spiked eigenvalnes, Ai 2 reflects the goodness of the estimated eigenvectors, and As 
controls the error of estimating the sparse idiosyncratic covariance matrix. The following 
theorem reveals the rate of each term. Its proof will be provided in Appendix of the 


snpplementary material (Fan and Wang, 2015). 


Theorem 4.1. Under Assumptions \2.1 . 2.2, 2.3, 4-^ o,nd 4-^’ */ plogp > 

max{T(logT)^/'’2,T(log(pT))^/'^i}, we have 


and by applying adaptive thresholding estimator (4-6) with 


Tij — Cut 


, and ut = ^J\ogp/T + a/i7p, 


we have 

As = Op (mpu]u‘^'^ . 

Combining the three terms, ||S — S||s = Op{T~^0 W _|_ mpU^^). 

The relative error convergence in spectral norm characterizes the accnracy of estimation 
for spiked covariance matrix. In contrast with the previons resnlts on Frobenins or max 
norm, this is the hrst time that the relative rate nnder spectral norm is derived. When 
Am X P and g = 0, we have 


1^ ~ 5]||s — Op+ nip 


logP ^ A 
T Vp 


Comparing the rate with (4.7), we see the difference nnder two different norms. The term 


y/plogp/T in (4.7) is enlarged to rate p/T, which is dne to the incoherence of the eigen-spaces 


of the low-rank signal matrix and sparse error matrix. Specihcally this rate comes from Ap 2 - 
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If we care only the relative error of the low-rank and sparse matrix spaces separately, we 
should only emphasize on A^i and A 5 . 

If Cj = 0 ( 1 ), the proposed is asymptotically just the spiked empirical eigenvalue Xj. 
However, when we have semi-weak factors whose corresponding eigenvalues are as weak 
as p/T, shrinkage is necessary to guarantee the convergence of A^i. On the other hand, 
if instead POET is applied to estimate covariance matrix, A^i = Op{p/(XmT) -|- 
which is only bounded. However since the empirical eigenvectors are not corrected, POET 
and S-POET attain the same rate for Al 2 , which actually dominates An and A 5 in high 
dimensional setting. Nevertheless, as to be seen in the simulation studies, S-POET can 
stabilize the estimator and improve the estimation accuracy. For this reason, we recommend 
S-POET in practice. 


4.3 Portfolio risk management 

Portfolio allocation and risk management have been a fundamental problem in hnance 


since 


Markowitz (1952 )’s groundbreaking work on minimizing the volatility of portfolios with 


a given expected return. Specihcally, the risk of a given portfolio with allocation vector w 
is conventionally measured by its variance w'Sw, where S is the volatility (covariance) 
matrix of the returns of underlying assets. To estimate large portfolio’s risks, it needs to 
estimate a large covariance matrix S and factor models are frequently used to reduce the 


dimensionality. This was the idea of Fan et ah (2015) in which they used POET estimator 


to estimate S. However, the basic method for bounding the risk error |w'Sw — w'Sw| in 


their paper as well as another earlier paper of similar topic (Fan et ah, 2012) was 


Iw'Sw — w'Swl < 


w 


They assumed that the gross exposure of the portfolio is bounded, mathematically ||w||i = 
0(1), which made it possible to only focus on the max error norm. Technically, when 
p is large, w'Sw can be small. What an investor cares mostly is the relative risk error 
RE(w) = |w'Sw/w'Sw — 1|. Often w is a data-driven investment strategy, which is a 
random variable itself. Regardless of what w is. 


maxRE(w) = ||S — S||^, 


which does not converge by Theorem |4.1[ The question is what kind of portfolio w will 
make the relative error converge. Decompose w as a linear combination of the eigenvectors 
of S, namely w = (P, fl)r] and 77 = (77'^, 77'^)'. We have the following useful result for risk 
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management. 


Theorem 4.2. Under Assumptions 2.1, 2.2, 4- 1^4-^ ^he factor model (4-1) with Gaus¬ 
sian noises and factors, if there exists Ci > 0 such that llr/^Hi < Ci, and assume Xj oc 
for j = 1,.. . ,m and T > Cp^ for a > 1/2, 0 < < 1, a + /? > 1, then the relative risk error 

is of order 


RE(w) = 


s 

w'S w 


w'Sw 


- 1 


= Op T 


• r 2 (q: + S — l) 1 -1 I 

-I— min4 ^ a —I 1—Q 

1 f) > 2 / rnpWj, ^ 


for a < 1. If a > 1 or \\rj^\\ > C 2 , RE(w) = Op{T + mpw]. 

The condition ||? 7 s||i < Ci is obviously much weaker than ||w||i = 0(l). It does not limit 
the total exposure of investor’s position, but only put constraint on investment of the non- 


spiked section. Note that under the conditions of Theorem 4.2, p/{TXj) —)■ 0, and S-POET 
and POET are approximately the same. The stated result hold for POET too. 


4.4 Estimation of false discovery proportion 

Another important application of the factor model is the estimation of false discovery 
proportion. For simplicity, we assume Gaussian data Xj ~ S) with an unknown 

correlation matrix S and wish to test separately which coordinates of pu are nonvanishing. 
Consider the test statistic Z = where X is the sample mean of all data. Then Z ~ 

S) with pi* = y/npi and the problem is to test 


Hoj ■ U*i — ^ 


vs 


Hij : /ij 7^ 0. 


Dehne the number of discoveries R{t) = 4f{j '■ Pj < 0 number of false discoveries 

V{t) = #{true null : Pj < t}, where Pj is the p-value associated with the test. Note that 
P{t) is observable while V{t) needs to be estimated. The false discovery proportion (FDP) 
is dehned as FDP(t) = V{f)/P{t). 


Recently Fan and Han (2013) proposed to employ the factor structure 


S = BB' + A , 


(4.12) 


where B = • • •, Xj and ^j are respectively the eigenvalue and 

eigenvector of S as before. Then Z can be stochastically decomposed as 


Z = pi* + BW + K , 

where W ~ N{0,lm) are m common factors and K ~ A(0, A) independent of W are the 
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idiosyncratic errors. For simplicity, assume the maximal number of nonzero elements of each 


row of A is bounded. In Fan and Han (2013), they demonstrated that a good approximation 
for FDP(f) is 


FDP^(^) ^ + '/;,)) + ^(ai(zt /2 - rii))]/R(t) , 


(4.13) 


2 = 1 


where Zt /2 is the t/2-quantile of the standard normal distribution, Ui = (1 — ||bj|p)“^/^, 
rji = b'W and b' is the row of B. 

Realized factors W and the loading matrix B are typically unknown. If a generic esti¬ 
mator S is provided, then we are able to estimate B and thus bj from its empirical eigen¬ 
values and eigenvectors A/s and W can be estimated by the least-squares estimate 

W = (B'B)-^B'Z. " ' ' 


Fan and Han 


(2013) proposed the following estimator for FDPA(t): 


FDPu(«) = ^|#(a.:(2,/2 + f/i)) + #(aj(2,/2 - , 


(4,14) 


2=1 


where hj = (1 — ||bi||^) and fji = b'W. The following assumptions are in their paper. 

Assumption 4.3. There exists a constant h > 0 such that (i) R{t)/p > hp~^ for h > 0 and 
9 > 0 as p ^ oo and (ii) di < h,ai < h for alii = 1,... ,p. 

They showed that if S is based on the POET estimator with a spike size Xm x P, under 
Assumptions A. 1| - together with Assumption 4.3 


\FDFu,POET{t) - FDPA(t)| = Op p' 


logp Hr* 


(4.15) 


T Vp 

Again we can relax the assumption of spike magnitude from order p to much weaker Assump¬ 


tion 


4.1 Since S is a correlation matrix, Ai < tr(S) = p. This, together with Assumption 


i3 leads us to consider that all leading eigenvalues are of order proportional to p°‘ for 

1/2 < a < 1. 

^ s 

Now apply the proposed S-POET method to obtain S and use it for FDP estimation. 
Then we have the following theorem. 


Theorem 4.3. If Assumptions \2.1\ \2.^ \4.1\ \4-2\ \4-3\ are applied to Gaussian independent 
data Xj ~ iV(R, S), and Xj oc p" for j = 1,... ,m, T > Cp^ for 1/2 <a<l,0</3< 
1 , a -|- > 1, we have 

\FDPu,sPOET{t) - FDPA{t)\ = Op(p''(||r*||p-Ut-”“^“’^>)) . 


19 


























Distribution of 1st Eigenvalue 


Distribution of 2nd Eigenvalue 


Distribution of 3rd Eigenvalue 



Figure 1: Behaviors of empirical eigenvalues. The empirical distributions of ^^/nj2{Xj/Xj — 
1 — Cj) for j = 1, 2, 3 are compared with their asymptotic distributions A^(0,1). 


Comparing the result with (4.15), this convergence rate attained by S-POET is more 
general than the rate achieved before. The only difference is the second term, which is 
0(T-^/^) if a + 4/3 > 1 and if a + 4/3 < 1. So we relax the condition from a = 1 

in Fan and Han (2013) to a G (1/2,1]. This means a weaker signal than order p is actually 


allowed to obtain a consistent estimate of false discovery proportion. 


5 Simulations 

We conducted some simulations to demonstrate the hnite sample behaviors of empirical 
eigen-structure, the performance of S-POET, and validity of applying it to estimate false 
discovery proportion. 

5.1 Eigen-structure 

In this simulation, we set n = 50, p = 500 and S = diag(50, 20,10,1,..., 1), which 
has three spikes (m = 3) Ai = 50, A 2 = 20, A 3 = 10 and corresponding ci = 0.2, C 2 = 
0.5, C 3 = 1. Data was generated from multivariate Gaussian. The number of simulations is 
1000. The histograms of the standardized empirical eigenvalues ^^/n/2{Xj/Xj — 1 — Cj), and 
their associated asymptotic distributions (standard normal) are plotted in Figure The 
approximations are very good even for this low sample size n = 50. 

Figure 1^ shows the histograms of \Ar(^jA/ll^iAll ~ ^ja) for the hrst three elements (the 
spiked part) of the hrst three eigenvectors. According to the asymptotic result, the values 
in the diagonal position should stochastically converge to 0 as observed. On the other hand, 
plots in the off-diagonal position should converge in distribution to A^( 0 , 1 ) for k ^ j after 
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Distribution of 1st Element of 1st Eigenvector Distribution of 1st Element of 2nd Eigenvector Distribution of 1st Element of 3rd Eigenvector 
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Distribution of 3rd Element of 1st Eigenvector 


Distribution of 3rd Element of 2nd Eigenvector Distribution of 3rd Element of 3rd Eigenvector 
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Figure 2: Behaviors of empirical eigenvectors. The histograms of the elements of the 
empirical vectors are depicted in the location (fc,j) for k,j < 3. Off-diagonal plots of values 
V^^jk/W^jAll /compared to their asymptotic distributions A^(0,1) for k ^ j 

while diagonal plots of values \/^(0i/ll^iAll ~ 1) compared to stochastically 0. 

standardization, which is indeed the case. We also report the correlations between the hrst 
three elements for the three eigenvectors based on those 1000 repetitions in Table The 
correlations are all quite close to 0, which is consistent with the theory. 

For the normalized nonspiked part ^jB/W^jsWy if should be distributed uniformly over 
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Table 1: The correlations between the hrst three elements for each of the three empirical 
eigenvectors based on 1000 repetitions 



1 st & 2 nd elements 

1st & 3rd elements 

2nd & 3rd elements 

1st Eigenvector 

0.00156 

-0.00192 

-0.04112 

2nd Eigenvector 

-0.02318 

-0.00403 

0.01483 

3rd Eigenvector 

-0.02529 

-0.04004 

0.12524 


the nnit sphere. This can be tested by the resnlts of Cai et ah ( |2013a ). For any n data 
points Xi,..., on p-dimensional sphere, define the normalized empirical distribntion of 
angles of each pair of vectors as 

^^n,p = ^yJp-2{-K/2-eij) 1 

\2/ l<2<J<n 

where G [0,7r] is the angle between vectors Xj and X^. When the data are generated 
nniformly from a sphere, ^n,p converges to the standard normal distribntion with probability 
1. Fignrej^shows the empirical distribntions of all pairwise angles of the realized 
{j = 1 , 2 , 3) in 1000 simulations. Since number of such pairwise angels is , the empirical 
distributions and the asymptotic distributions iV(0,l) are almost identical. The normality 
holds even for a small subset of the angles. 

Lastly, we did simulation to verify the rate difference of {^j,ej) for m = 1 and m > 1, 


revealed in Theorem 3.2 (iii). We choose n = [10 x 1.2^] for I = 0,... ,9, p = [n^/100], 
where [■] represents rounding. We set Xj = 1 for j > 3 and consider two situations: (1) 
Ai = p, X 2 = 1, (2) Ai = 2 A 2 = p. Under both cases, simulations were carried out 500 
times and the corresponding angle of empirical eigenvector and truth was calculated for each 
simulation. The logarithm of the median absolute error of (^]^,ei) — 1/\/l + ci was plotted 
against log(n). Under the two situations, the rate of convergence is and Op{n~^) 

respectively. Thus the slope of the curves should be —3/2 for a single spike and —1 for two 
spikes, which is indeed as the case as shown in Figure]^ 

In short, all the simulation results match well with the theoretical results for the ultra 
high dimensional regime. 


5.2 Performance of S-POET 

We demonstrate the effectiveness of S-POET in comparison with the POET. A similar 
setting to the last section was used, i.e. m = 3 and Ci = 0.2, C 2 = 0.5, C 3 = 1. The sample 
size T ranges from 50 to 150 and p = Note that when T = 150, p ~ 1800. The spiked 
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Dist of angles of 1 st Eigenvector (nonspiked part) Dist of angles of 2nd Eigenvector (nonspiked part) Dist of angles of 3rd Eigenvector (nonspiked part) 



-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 


Figure 3: The empirical distributions of all pairwise angles of the 1000 realized 
(j = 1, 2, 3) compared with their asymptotic distributions A^(0,1). 


Convergence of 1-spike vs 2-spike modei 



Figure 4: Difference of converged rate of (^i,ei) — 1/a/1 + ci for a single spike model and 
a two-spike model. The error should be expected to decrease at the rate of and 

Op{n~^) respectively. 

eigenvalues are determined from p/{TXj) = Cj so that Xj is of order \/T, which is much 
smaller than p. For each pair of T and p, the following steps are used to generate observed 
data from the factor model for 200 times. 

(1) Each row of B is simulated from the standard multivariate normal distribution and 
the column is normalized to have norm Xj for j = 1, 2, 3. 

(2) Each row of F is simulated from standard multivariate normal distribution. 

(3) Set = diag((jf,..., <7^) where Uj’s are generated from Gamma(a, /3) with a = 13 = 

100 (mean 1, standard deviation 0.1). The idiosyncratic error U is simulated from 
iV(0,S„). 
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Relative spectral norm error (mean) 


Relative Frobenius norm error (mean) 




Spectral norm error (mean) 


Max norm error (mean) 



T 


T 


Figure 5: Estimation error of covariance matrix under respectively relative spectral, rela¬ 
tive Frobenius, spectral and max norms using S-POET (red), POET (black) and sample 
covariance (blue). 

(4) Compute the observed data Y = BF' + U. 

Both S-POET and POET are applied to estimate the covariance matrix S = BB' -|- 
Su. Their mean estimation errors over 200 simulations, measured in relative spectral norm 
||S — S||^, relative Frobenius norm |S — S||^p,, spectral norm ||S — S|| and max norm 
11^ — S||inax, are reported in Figure q The errors for sample covariance matrix are also 


24 




















Figure 6: Comparison of estimated FDP’s with true values. The left plot assumes knowl¬ 
edge of B, the middle and right ones are corresponding to POET and S-POET methods 
respectively. The results are aligned along the 45-degree line, indicating the accuracy of the 
estimation of FDP. 


depicted for comparison. First notice that no matter in what norm, S-POET uniformly 
outperforms POET and sample covariance. It affirms the claim that shrinkage of spiked 
eigenvalues is necessary to maintain good performance when the spike is not sufficiently 
large. Since the low rank part is not shrunk for POET, its error under the spectral norm is 
comparable and even slightly larger than that of the sample covariance matrix. The error 
under max norm and relative Frobenius norm as expected decreases as T and p increase. 
However the relative error under the spectral norm does not converge: our theory shows it 
should increase in the order p/T = a/T- 


5.3 FDP estimation 


In this section, we report simulation results on FDP estimation by using both POET 
and S-POET. The data are simulated in a similar way as in Section 5.2 with p = 1000 and 
n = 100. The hrst m = 3 eigenvalues have spike size proportional to p/y/n which corresponds 


toQ; = /3 = 2/3in Theorem 4.3 The true FDP is calculated by using FDP(f) = V{t)/R{t) 


with t = 0.01. The approximate FDP, FDP^(t), is calculated as in (4.13) with known B 
but estimated W given by W = (BB')“^B'Z. This FDP^(t) based on a known sample 
covariance matrix serves as a benchmark for our estimated covariance matrix to compare 
with. We employ POET and S-POET to get FDPu^poET{t) and FDPu,SPOET{t)- 


In Figure 


three scatter plots are drawn to compare FDPA(t), FDPp,poErit) and 
FDPuspoErit) with the true FDP(f). The points are basically aligned along the 45 de¬ 
gree line, meaning that all of them are quite close to the true FDP. With the semi-strong 
signal A oc p/\/n, although much weaker than order p, POET accomplishes the task as well 
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as S-POET. Both estimators performs as well as if we know the covariance matrix S, the 
benchmark. 


6 Proofs for Section 


6.1 Proof of Theorem 3.1 


We hrst provide three useful lemmas for the proof. Lemma 6.1 provides non-asymptotic 
upper and lower bound for the eigenvalues of weighted Wishart matrix for sub-Gaussian 
distributions. 


Lemma 6.1. Let Ai,..., A „’s be n independent p dimensional sub-Gaussian random vectors 
with zero mean and identity variance with the sub-Gaussian norms bounded by a constant 
Cq. Then for every t >0, with probability at least 1 — 2exp(— one has 


^ IL ^ fL 

w — max{(5, 5^} < Ap^—^^tCjAjA'^ < Ai tCjAjA'j < tc-I-max{(5, 5^} . 

i=l i=l 


where 6 = C^Jp/n -\- tj^/n for constants C,c > 0, depending on Cq. Here \wi\’s is bounded 
for all i and w = n~^ 

The above lemma is the extension of the classical Davidson-Szarek bound [Theorem 


11.7 of 

Davidson and Szarek 

(2001 

to the weighted sample covariance with sub-Gaussian 

distribution. It was shown by 

Vershynin] ( 

2010 

) that the conclusion holds with Wi = 1 for all 


i. With similar techniques to those developed in Vershynin (2010), we can obtain the above 


lemma for general bounded weights. The details are omitted. 

Now in order to prove the theorem, let us dehne two quantities and treat them separately 
in the following two lemmas. Let 

m p 

A = and B = XjZjZ'j, 

j=l j=m+l 

where Z, is columns of XA“ 2 . Then, 




( 6 . 1 ) 


i=i 
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Lemma 6.2. Under Assumptions 2.1 - 2.3, as n ^ oo, 


Vn(^Xj{A)/Xj - 4 N{0,Kj - 1), forj = 1,... ,m. 


In addition, they are asymptotically independent. 


Proof. Note that = n ^ same eigenvalues as matrix Xj^ = 

Z'Z, where Z is an n x m matrix with i.i.d. rows, which are sub-Gaussian distributed 


n 


-Irv/ 


with mean 0 and variance A/A^. Xk here is only used for normalization. Therefore, we are 


in the low dimensional situation as Theorem 1 of Anderson (1963). The differences here 


are two-fold: on one hand we encounter sub-Gaussian distribution; on the other hand the 


eigenvalues could diverge with different rates of convergence. The result of Anderson| (1963) 
can be extended in both directions. 

Extension from Gaussian to sub-Gaussian is trivial. The only difference is that the 
kurtosis of Gaussian is replaced by that of sub-Gaussian distribution. Extension to diverging 
eigenvalues requires careful scrutiny of Anderson’s original proof. Detailedly, following the 


notations of Anderson (1963) and (2.22) therein, we have 


Hj ■.= - A,) = U„ - - \,W„), 


th 


where Ujj = - Xj), Mjj = ^/^(Afc + n sTTfc), Wjj = J2k^j ^jk and Fjk is the j 

element of the eigenvector of A multiplied by ^/n for j ^ k. We claim Mjj/Xj and Wjj 
are bounded with high probability, so Hj/Xj and Ujj/Xj share the same limiting distribution. 
Therefore the limiting distribution of Aj(A) for j = 1,..., m are independent and 

VS(A,(A)/A, - 1) = yS(A,(A)/A, - 1) 4 . /V(0, (k, - 1)A|/A|). 

So the lemma follows. 

It remains to show Mjj/Xj and Wjj are Op(l). Following the cofactor expansion argument 
of Section 7 in Anderson (1963), it is not hard to see F,k = Op{^XjXk/{Xj-XkY). By 


assumption, \Xj — Afc| > 5o max{Aj, A^}. Hence Fjk = Op{1/5q) = Op(l) and so is Wjj. In 
addition, Mjj/Xj = Op{J2k^j Xl/{Xj — A^)^) = Opim/d/f) = Op(l). Now the derivation is 
complete. □ 


Lemma 6.3. Under Assumptions 2.1 - 2.t^ for j = 1, ■ ■ ■ ,m, we have 


Xk{B,)/Xj = ccj + Op{xA.Jl^fJft^ + op{n 2 ), for k = 1,2,... ,n. 


Proof. By dehnition of B, B = n ^ZpApZ'^ where Zp is n x (p — m) random matrix with 
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independent sub-Gaussian entries of zero mean and unit variance and is the diagonal 
matrix with entries Am+i, ■ ■ ■ , Ap. By Lemma 6.1 with t = for any k < n, 


p — m 
Therefore, 


—A,(B) = ^ A, + Op(/)=c + Op(/)+op(n-V2). 


p — m 


j=m-\-l 


Afc(B) _ nAfc(B)p-m 

Xj p — m nXj 


p/ P 


= ccj + Op ( Xj ) + op[Cjn~ 


□ 


Proof of Theorem \3.1\ By Wely’s Theorem, Aj(A) + An(B) < Xj < Aj(A) + Ai(B). Therefore 


from Lemma 6.3 


A, _ A, (A) 


A, A, ' 


+ cCj + OpixO . + op{cjn , 


By Lemma 


6.2 


and Slutsky’s theorem, we conclude that y/n(Xj/Xj — (1 + cCj■ + 


Op{Xj ^ \/p/n) ) j converges in distribution to N{0,Kj — 1) and the limiting distributions 


of the hrst m eigenvalues are independent. 


□ 


6.2 Proofs of Theorem 3.2 


The proof of Theorem |3.2 is mathematically involved. The basic idea for proving part 
(i) is outlined in Section We relegate less important technical lemmas to the end of the 
proof in order not to distract the readers. The proof of part (ii) utilizes the invariance of 
standard Gaussian distribution under orthogonal transformations. 

Proof of Theorem \3.^ (i) Let us start by proving the asymptotic normality of for the 
case m > 1. Write 

X = (Z^Al, ZpA|) = (^Zi,..., 

rri’) \/Xm+l'^m+ly ■ ■ ■ 1 V^Zp) , 


where each Z,- follows a sub-Gaussian distribution with mean 0 and identity variance 


Then by the eigenvalue relationship of equation (2.2), we have 


AlZ'^u^ ^ , ZpAl^. 

^jA —- / anci Uj — 


nXj 


nXj 


+ 




nXi 


n\j 


( 6 . 2 ) 



























Recall Uj is the eigenvector of the matrix S, that is, -XX'u^ = XjUj. Using X = 
1 1 

(Z/iA^, Z^A^), we obtain 




u, = Du, - Au,, 


(6.3) 


where we deno te D = (nXj) ^Z^A^Zg — ccjl„, A = A,/A,- — 


equation (6.3) by \jnXj and employ relationship (6.2 

as follows: 


1 + cCj). We then left-multiply 
to replace Uj by and 




AK^Z^^Z^-IJAI;, , AlZ'^DZ^Al;, 

^jA H ^jA 


X, 


A1Z-^DZ^A| , 

nXi 


nXi 


>jB 


M,a- 


(6.4) 


Further dehne 


^ „ 

ke[m]\j ^ 


X, 


A, A/c 


^kA^kA• 


Then we have R(I — A^/Xj) = 1^ — ^jA^'jA- Note that R is only well dehned if m > 1. 
Therefore, by left multiplying R to equation (6.4), 


A 


^jA)e,A =R( ^ ) 'k( ^ 


A,- 


A 


A \ 2 


A. 


IjA 


r A-z^dz.a| . 


(6.5) 


nXj 


where K = n ^71 a — R + Xj{nXj) ^Z'^DZ^i. Dividing both side by ll^j^ll, we are able to 
write 


where 


»jA 


»jAl 


A 


A \ 2 


A, 


^3A = R[ — ] K( —) ejA + rn, 


A 


A \ 2 


Tn = 


>jA 


»jAl 


A 


.e,A-l)ej.4 + R(T)"K(yA 


R 


A1Z(4DZbA| 


nXi 


A, 


AR 


Aa\W ^ 


A,- 


>jA 


GjA 


>jA\ 


IjAl 


^jA 


IjAl 


■'jA 


( 6 . 6 ) 


(6.7) 


We will show in Lemma 6.4 below that r„ is a smaller order term. By Lemma 6.4 
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noticing that {A.A/^j)^^jA = ^ja, 


sjA 


GjA + 0} 


sjA\ 




( 6 . 8 ) 


Now let us derive normality of the right hand side of (6.8). According to dehnition of R, 

ajkGkAe'kA- (6.9) 

^ fceH\j ^ ^ fee[m]\j 


Let W = ^/n}^ejA = (hhi, • • •, hhm) and be the m — 1 dimension vector without the 

element in W. Since the diagonal element of R is zero, R(AA/Aj)^W depends only 


on Therefore, by Lemma 6.5 below and Slutsky’s theorem. 


VnR^^^y Ke^vi + Oj 


P 

nX] 


Nm(0, 

k^[m]\j 


(^jk^kAGkA 


Together with (6.8), we concludes (3.3) for the case m > 1. 


Now let us turn to the case of m = 1. Since R is not dehned for m = 1, we need to hnd 


a different derivation. Equivalently, (6.3) can be written as 


—ZiZ'^ui H-—ZgA^Z^Ui — T— Ui. 

n nXi Ai 


Left-multiplying u'^ and using relationship (6.2), we obtain easily 


em = 1 


CCi -^1 /TV 1 

—u^Dui = 1 


Ai/Ai Ai 


CCi 


+ Op(Ai^^p7^), 


where D is dehned as before and ||D|| = Op(A^ ^x/p/n) according to Lemma 6.4 Expanding 
\/l — cci/x at the point of (1 4- cci), we have 


^lA = 


+ 


CCi 


a/ 1 + cci 2(1 -4 cci)^A 


Ai/Ai — (1 -|- cci) j -|- Op 


P , -1 

—^ + cin 

nAj 


Note that from Lemmas 


6.2 


and 


6.3 


Ai/Ai — (1 -|- cci) — (||Zi|//n — 1) -|- Op(A;^ ^{p /-|- 
op{cjn~^^'^). Therefore due to the fact v^(||Zi|//?7, — 1) is asymptotically A^(0, ki — 1), we 
conclude 


2(1 + cci)3A ^ 1 

-=- VnUiA - 1^ , _ 

CC\ \ -yl -|- CCi 


+ Oj 


p 

nXf 


N(0,Ki - 1). 
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This completes the first part of the proof. 


(ii) We now prove the conclusion for non-spiked part Recall that Xj fol¬ 

lows X(0,A). Consider Xf = diag(Im,Do)Xj where as defined in the theorem Dq = 
diag(-^c/Am+i, • • ■, \/c/\p). Here the index R means rescaled data by diag(Im,Do)- Af¬ 
ter rescaling, we have Xf ~ iV(0, diaglA^, clp_m)). Correspondingly, the n x p data matrix 

1 1 

X^ = Xdiag(I^,Do) = (Xa,XbDo) where Xa = Z^A^ and X^ = Z^A^ as the notations 


iR 


,R 


-R 


R 


before. Assume ^ • and are eigenvectors given by S and S of the rescaled data X 

R R R '' R 

and ■ It been proved by Paul (2007) that ho := $jB/\\^jB\\ is distributed 

* ^ 

uniformly over the unit sphere and is independent of due to the orthogonal invariance 

^ R /V /V 

of the non-spiked part of Hence it only remains to link $,jB/\\$,jB\\ with ho. 


R 


Note that S = n'^XX' and S = n-^X^X 


so 


= 


-Xb(I 

n 


Dj)x; 


B 


n 


(Aj c)XjX^ 


j=m+l 


where the last term is of order Op ( \/p/n) by Lemma 

R 


and Kahan 


(1970), ||u 


6.1 


Thus by the sin 6 theorem of 


Davis 


ut 


= Op{Xj ^\/p/n). Next we convert from to using the 


basic relationship (2.2). We have. 


Dn 




>jr B 


11^ 


jB\ 


11^ 


R 

jBl 


< 


DqX'^Uj 


V^j 
=:/ + //. 


DqX'^Uj 


nAj III 


A, 


A,ll^ 


jB\ 


DoXr'u? 


3\\^jB\ 


riAfll^ 


R 

jBl 


A. 


Aflll^BlP 


+ 


DoX'^ 


nXfU%\\ 


u,- 


u 


R\ 


First we claim II = Op{Xj ^\/p/n) since ||uj — uj^|| = Op{Xj ^\/p/n), ||X'^/iynA^|| = 
Op(vTj), Xj/Xf = Op(l) and l/|||jp|| = Op(l/y/cj) according to Lemma 6.6 Now we 
show I = Op{\Jn/p) -|- op{n~^/‘^). From the proof of Lemma 


6.6 


we have 


AjIlIjsilVAi = ccj + Op(A. ^\/pJn) + op{cjn 


Then some elementary calculation gives the rate of I. Therefore, ||Do^ 


jB/ IISjBl 


Op{\/n/p) + op{n The conclusion (3.4) follows. 


To prove the max norm bound (3.5) of |||jp||max, we first show ||ho||max = Op(v^log p/p). 
Recall that ho is uniformly distributed on unit sphere of dimension p — m. This follows 
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easily from its normal representation. Let G to be p — m dimensional multivariate standard 
normal distributed, then ho = G/||G||. It then follows 


|ho||max= max |Gi|/||G|| = Op{^/\og^). 

i<p—m 


From the derivation above, 


I 11 max ^ 




\ hllhsi 


|Do‘l 




011 max I 5 


which gives Op/{n\^) + \/\ogp/p)) = Op{p/+ ^y\ogp/{nXj)), given the fact 

that W^jbW = Lemma 6.6 Thus we are done with the second part of the proof. 

(hi) The proof for the convergence of ||^jvi|| and given in Lemma 

the result for ||^jvi|| directly gives (3.6) with the same rate. For m > 1, from 


6.6 


6.6 


If m = 1 , 


we have 


ll^i^ir = (1 + ccj) ^ + Op(^^p/{n\)) + CjH . 


On the other hand, from Theorem 3.2 (i), = Op{p/{nX^) + 1/n) for fc < m and k ^ j. 


So = (1 + cCj) ^ + Op{^p/{nXj) + CjU + 1/n), which implies (3.6). 


□ 


Lemma 6.4. As n ^ oo, ||r„|| = Op{Xj ^\/p/n + 1/n). 

Proof. Dehne = ijA/UjA\\ “ {ijA/UjAh(^jA)ejA and a^, and as follows: 


Oj 


f^3 = 


Ri^yKe.A 








+ AIIRI 


7i 


R 


xiz'^mpXl 


nXi 


11 ^ 


jA\ 


We claim that 7 j = Op(A Ay/p/n) and Oj,/3j, ||vj|| = Op{X,^^/p/n + n 2 ). Then the rate of 


could be easily derived from its definition (6.7) and the above results. To be specihc, hrst 


notice the following two inequalities: by (6.5), ||vj|| < (]j + 7 ^; by orthogonal decomposition 


^ja/UjaW = {^jA/UjA\\,ejA)ejA + Vj, we have 1 - {^jA/UjAh^jA) = 1 - “ llwll^ ^ 
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||vj|p. Note that we always choose ^ so that is positive. Therefore 

^jA 


GjA 


>jA\ 




>iAl 


( 6 . 10 ) 


< l|vil|(l + ||vj||) < ||vj||(l + /?j+7j) 


Hence, by (6.7), 




sjA 


II^.aII 


-jA 


+ 7i 


< IIV, I 


‘'ill + /^i(l + /^i + 7i)) + 7i = Op{Xj + 1/n). 

It remains to show the claims above. Let us first show the rate of convergence of 'jj. In 


order to prove this, we need the rate of ||D||. By Lemma 6.1, ||(p — m) — cl|| = 

so we have 


IDI 


< 


II 1 Z^A^Z'^ 
n Xj 
p — m 1 
nXj p — m 


— CCjlr 


ZbAbZ'^ - cl|| + c|^| = Op(A - ^\/p/n ). 


nX 


Hence 


7i < 


RAl 


A, 


Z'a 

||D|| 

ZpAl 



A, 


a/u 


^nXj 


11^ 


= OpiXj ^ . 


since the other terms except ||D|| are all Op{l). Indeed, (6.9) says the hrst term is asymp¬ 
totically bounded. We have shown in the proofs of Lemmas |6.2| and |6.3| that the second, 
third and hfth terms are Op(l). In addition, the facts that ||^jp|| < 1, II^jaII (1 + ccj )“2 
imply the last term is Op(l). 

Then let us show that aj and are Op{X~^^p/n -|- n~^). The rate of ||K|| is needed. 
By Lemma 


6.1 


WTJj^a - I|| = Op{^m/n) = Op{n 2 ). Thus, 


IKII = 


-Z(4Za-I„ + ^-Z(4DZ^ 
n X-n 



1 , 

, A,-. 

1 , 

< 

~Z^Z^ — \n 

+ D 

-z^z^ 


n 


n 


= Of 


P -1 

-|- n 2 


nX] 


Then easily we get aj = Op(A - ^\/p/n + n 2 ). Note that from Theorem 


3.1 


that A = 
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Aj/Aj - (1 + ccj) = Op{Xj ^\/p/n + n 2 ), so 


< ||K||||RAa/Aj|| + A||R|| = Op{Xj ^\/p/n + n 2 ), 


where similar to (6.9), ||RA^/Aj|| and ||R|| are Op(l). 

Finally, ||vj|| < (3j + 7 ^ = Op{X~^\/p/n + n~^). The proof is complete. 

Lemma 6.5. + Op{X-J^,/pfn) A A(0,I„,_i). 

Proof. Recall W = ^/nl^ejA■ Then, by the dehnition of K, 


W 


—-jzPZlPZj — \fnG.jA H— A —^ZADZj . 


□ 


Its component is Wt = n + bnt for t G [m] \j where 6nt = (Aj/Aj) ■ n ^Az'DZ^. 

Denote W = {n-^/‘^Z'fLj)t^yrn]\j and 5n = {3nt)t&[m]\j- We claim as n -)■ 00 , ||5n|| = 
Op{xfApA)- So w(-^) = W + Op{xfAp/ n). In order to prove the lemma, it snf- 
fices to show that W follows A(0,lm-i)- That is, for any vector a of m — 1 dimension, 
E[exp( 2 a'W)] —)■ exp(—||a|p/2) almost snrely. 


E 

gia'W 

= E 

e[ 

[f 

giatZ'Zj/7n 2 ^ 




t£ 

m]\j 




where fj{u) = E[exp(mZfcj)] is the characteristic fnnction of each element of Z^. The sub 
index j means we actually allow different characteristic functions for the columns of Z^ and 

Zb. 

By Taylor expansion, we can easily derive 


— 1 — ix + x^/2| < (|a:|^/6) A x ‘^, 


from which it holds that 


u 


\fj{u) - 1 - iuE[Zkj] + yE[Z^^.]| < u^E 




L 6 


kj 


E 


H I 7 |3 A y2 

g \^kj\ A ^kj 


goes to 0 as M —)■ 0 and is dominated by the integrable function 


So by Dominated Convergence Theorem the right hand side is o{vf). Therefore, fj{u) = 
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1 — V?/2 + o{v?). Using this result, we have 


E 


Va'W 


= E 


n 

[ n n 


is 




1 ^2 


+ 0(1) 


4n 1 


k=l 


2n 


+ 0 ( 1 ) 


nE[i 


k=l 


= 1- 


II II 7^ 
2n 


+ 0 ( 1 ) 


2 \ ^ 


2n 


+ o(l) 14' exp(-||a||V2) 


which implies W follows iV(0,/m-i). 

Now let us validate ||5n|| = Op{X~^^/pjn). Clearly 


\Snt\ + \^j/^j 


n ^ 


IDII . 


We have shown that \Xj/Xj\ = Op{l) and ||D|| = Op{Xj ^\/p/n). It suffices to show 
^Z(Z,=Op(l). 


E 


iz(z,- 

n 


-E 

n 


E 


:z;z,)^iz, 


Ie 

n 


Z'Z, 


= 1 


So by Markov inequality, we have \n ^+Z(Zj| is Op(l), which generates 5nt = Op{Xj ^\/p/n). 
So is ||5„|| since 6n is of hxed length m — 1. The proof is complete. □ 


Lemma 6.6. W^jaW = (1 + ccj) + Op{Xj ^\/p/n + cjn ^+) and 
ll^ifill = + Op{^Jl/Xj + yc“n"+2). 


Proof. If m = 1, Theorem 3.2 (i) directly implies the conclusions. So in the following, we 


only consider m > 1. Recall that X = (Zy^A^, ZpA^). Let Z = (Z^,Zp), then 

Z = XA-5 = , j„)'A-y 

where A = diag(AA, Ap) and A = diag(Ai,..., An). Dehne 

A = diag(l,. .., 1 , Anj+i, ..., Ap) 


and consider the eigenvalue of the matrix n ^ZAZ'. The j-th diagonal element of the matrix 
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must lie in between its minimum and maximum eigenvalues. That is 


AJ-ZAZ')< (-ZAZ') = A^4^<Ai(-ZAZ'), 

' J jj n 

k=l 


n 


where S,jk is the /c-th element of the j-th empirical eigenvector for j < m. Divided by Aj, 


then by Theorem 3.1 and Lemma 6.1 both the left and right hand side converge to 


CC-i 


1 + CCi 


+ Oi 


P 

nX) 


+ CjU 


So /Afc also converges to the above quantity. Also, by dehnition, Xk/Xk = Op{Xk) 

ioT k < m while the ratio is 1 for k > m. By Theorem 


3.2 


i)^e,k = Op{n-^X,Xk/{X,-Xkf) 


for j ^ k < m. Hence, = Sfc=m+i ^i-gain converges to the above quantity, which 

implies the rates of convergence for WijpW and ||^jvi|| = (1 — □ 


A Comparison on assumptions 


The following assumptions are from Fan et ah (2013), where the results were established 
for the mixing sequence. But we only consider i.i.d. data in this paper. The assumptions 


are listed for completeness and comparison with Assumptions 4.1 and 4.2 


Assumption A.l. ||p”^B'B — fioH = o(l) for some m x m symmetric positive definite 
matrix fio such that SIq has m distinct eigenvalues and that Ainin(r 2 o) and Amax(^^o) 
bounded away from both zero and infinity. 


Assumption A.2. (i) {ut,b}t>i is strictly stationary. In addition, ]E[njt] = ¥,[uitfjt] = 0 
for all i < p,j < rn and t <T. 

(a) There exist positive constants ci and C 2 such that Amin(S„) > ci, ||Sm||oo < C 2 , and 
minjj VariuitUjt) > Cp 

(Hi) There exist positive constants ri,r 2 , and 62 such that for s > 0,i < p, j < m, 


Pdwjil > s) < exp{—{s/biY^) and P(|/jd > s) < exp(—(s/ 62 )^^) • 

We introduce the strong mixing conditions. Let nnd denote the cr-algebras 

generated by {(fs,Us) : —00 < s < 0 } and {(fs,Us) : n < s < 00 } respectively. In addition, 
dehne the mixing coefficient 

a{n)= |P(A)P(H)-P(AH)|. 
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Assumption A.3. There exists rs > 0 such that 3r^ ^ + 1.5r2 ^ ^ > 1 and C > 0 

satisfying a{n) < exp(— for all n. 


Note that for the independence case, Assnmption A.3 is trivially satished since a{n) = 0 
for all n. 


Assumption A.4. There exists M > 0 such that for all i < p and s,t <T, 

(i) ||bi||max < M, 

(a) E[p“^/^(u'^ut — Eu'^Ui)]^ < M, 

(iii) E||p“^/2 hiUnW"^ < M. 


B Proofs of Theorems in Section 4 


In order to prove theorems in Section convergence rate of the sparse error matrix 
is reqnired. The following theorem states the convergence rate for estimating by 


the thresholding procedure in (4.6). Its proof and related technical lemmas are given in 
Appendix 

Theorem B.l. Under the assumptions of Theorem E3 by applying adaptive thresholding 


estimator (4-6) with Tij = Ci>jT{c!u,iidPuyf)^^’^ o.nd ut = \Aogp/T + ^/ijp, we have 


T 


|S„ - S«il = Op{ujj, ‘‘mp) 


Given Theorem B.l 


we are 


ready to start showing theorems in Section]^ The proofs 


are 


built based on conclusions in Section |3l 


Proof of Theorem 4-i. We hrst prove the theorem for term A^^i. Write B = (bi,..., b^) and 
the minimizer of (4.5) as B = (bi,..., hm)- Since B is just the eigenvectors (unnormalized) 
of S, we have: 


= ||bd|2 and G = b,- 


Then Af = 


— cp/n or = 


m-ip 


— cp/n if c is unknown. Let A = 


diag(||bi|p,..., ||bmin he the diagonal matrix of the hrst m empirical eigenvalues and 
r = (bi/||bi||,...,bm/||bm||) be the empirical eigenvector matrix. In Sections 


3.1 


and 


3.2, our results for empirical eigenvalues and eigenvectors imply the following: 


IA-'/2^A^ _ a)A-i/2|| ^ Op{X4/^/^ + T-^/^) ; 


(B.l) 


and 


|rr-D|| =Op(A-'vyv + r-'/2), 


(B.2) 
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where D = diag((l + cci) , (1 + ccm) Now let us start to bound A^i and Al 2 - 


Ali < ||A"^/^r'(f A^r - rAr')rA-^/2|| ^ ||A-V2p/(p^r/ _ BB')rA-^/2| 


=: aS + A(^) 


LI 


We handle the two terms separately. 

Ai^\^ < ||A-^/2r'(f A^r - rDADr')rA-^/2|| ^ 


where we used ^ I. The right hand side is further bounded by / + 2II + III with 

I = ||A-^/2(r'f - D)A^(r r - d)a~^/2|| , 

II = ||A-^/2(r'f - D)A^DA-^/2||, III = ||A-^/2D(A^ - A)DA~i/2|| . 


By equations (B.l) and (B.2), we conclude that II and III are of order Op{X^ ^Jp/T+T 
and / is of smaller order. Thus A^^l = Op(T“^/^). In order to derive rate of A^^l, denote 
A = diag(||bi|p,..., ||bmP) and f = (bi/||bi||,... ,bm/||bm||) so that BB' = fAf . We 
could treat A^^l similar to A^\\ A^^l could be bounded by I' + 211' + III' with 


r = ||A-^(rT - i)A(r r - i)a-5||, 

II' = ||A-^(r'f - I)AA-5||, III' = ||A-^(A - a)a-5 


By Weyl’s theorem, \Xj — im||^| < ||Su|| < C, so III' = 0{1/Xm)- By sinO theorem, 

lir'f - III = ||r'(f - r)|| < ||f - r|| < c\\i:j/Xm = o(i/aj , 

(o\ 

so is II'. Since I' is of smaller order, we conclude A^l = 0{1/Xm)- Therefore, Api < 
A™ + aS = Op(T-V2). 

The bound for term Ap 2 is derived in the following. Recall that 

Al2 = ||©■5^i'(fA^r - BB')ri©-i||, 


which is bounded by 




l©-5ri'fAf ri©-5|| =; A^^,^ + A^^ 


^L2 


aS < ||©■i||^iTf ||A“|| =Op{p/T), 


»-l|| IIO'f’l|2 
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because by Lemma 


6.6 




Op{^) = Opi^p/iTX^)). 


AS < 


I© 




Op(l/Am) , 


as ||fi^r|| = ||rr' — rr || = 0(||S„||/Am) = Op(l/Am) by sin^ Theorem. Finally, aS = 
Op{p/T + 1/Am)- 

— S„||, it suffices to bound 
So 


Finally let us look at term A 5 . Since As < ||S’ 
|S^ — S^ll, which has already been done in Theorem 


B.l 


As = Oplrripi 


flogp 1 \{ i - 9)/2 

l^^pJ 


□ 


Proof of Theorem f.2. 


The numerator of the relative risk is bounded by 


w'(fA^r - BB')w| + |w'(S^ - S„)w 


The second term is bounded by ||S„ — S„||||w|p, thus is Op(A 5 ||w|p). By using w = 
(r,ri)r/, the hrst term can be written as 


|(r/'^r' + r/),Fl')E(rr/^ + nrjs)\ = Op{ri'J^'ETrj^ + , 

that 


s^/ 


where E = FA F — BB'. It is easy to see from the proof of Theorem 


r/'^F'EFr/^ = Op(ApiAi||r/^in, 


4.1 


By Theorem 3.2, ||ri'f ||max = max^ ||^jsl|max = Op(p/(TA^^) + ^logp/(nAm)). From 


proof of Theorem 4.1, we know that ||f2'F||max A ||^^^r|| = Op(l/Am)- Therefore, 


rj',,n'Enrj^ < Wri^WiOpiWn'TWLJ^ II + ||f^T||^,J|A||), 


which gives rj'^Cl'Eflr]^ = Op{cl^ + \ogp/T + AJ). 

The denominator is lower bounded by w'Sw > Am||?7yi|P + c||?7p|P- Thus the relative 
risk is of order 


Of 


A 


Ll 


^iWPAf + 4n + '^Ogp/T + A 


-1 

m 


Am||r/^||2 + c||r7B||2 


+ Ac =Op T 




+ rripWrj, 


l-q 


for a < 1 if we plug in the convergence rate of Api and As in Theorem 4.1 If a > 1, the 
relative risk is Op{T~^0 _|_ rnpW^T'^). Note the rate Op{T~‘^0+h-T/h^ comes from in the 
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numerator. If we further assume ||r/^|| > C 2 , this rate becomes c^/\m dominated by T 
thus the relative risk is again of order + rUpW^^). □ 


Proof of Theorem 4-3. The proof follows Theorem 1 of Fan and Han (2013). Using their 
notation, we have 

F^uit) - FDPA(t) = (Ai + A2)/R{t) + , 


where with W = (B'B)-^(B'Z), 

P r ^ 

Ai = [$(a,(;2i/2 + b'W))-<F(a,(^t/2 + b'W)) 


i=l 

P 


Ao — 


- b'W)) - Ha,(z „2 - b'W)) 

i=l 


We just need to bound Ai, then A 2 can be bound similarly. As shown in Fan and Han 


(2013), 


m 

iA.i < c( E ly - Ail +- «iii + v^difii + v^)ii«? - «iii). 


j=l 


S ''Si 

where and ^ ■ are the eigenvalue and eigenvector of S dehned in (4.8). So by Weyl’s 


theorem and Theorem 4.1 


\Xj — Xj\ < ||S — S|| — Op(AiiAi + Ai 2 + As) 


= Of 


( Ai /logp p\ 
vVt^V T ^ t) 




'' S '' s 

By sinO theorem, we also have ||^ ■ — ^ J < Op(||S — S||/Aj). So hnally 


|A./B(«)| - Op(p''(i + ^ + (i!^ + 1) (V + if))) ■ 

Since Cp^~^ < Cp^ < T, 

|roPp(t) - FDP^(t)| = Op(p\\\^^*\\p-'^ + . 


□ 
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C Convergence rate of error matrix 


In order to achieve convergence rate Theorem B.l for the covariance matrix of idiosyn 


cratic error, we employ the following lemma from Fan et al. (2013) 


Lemma C.l. Suppose that (logp)®“ = o(T) where a = 3r;^ ^ + 1 and Assumption 4-1 and 4-2 


hold. In addition, suppose that there is a sequence ar = o(l) so that maxj<pT ^ ~ 

= Op{a\) and ni8CKi<p^t<T\'Uit — Uit\ = op(l). Then there is a constant C > 0 in the 


adaptive thresholding estimator (4- 6) with Tij = CijjT{du,iiAu,jjY^'^ and 


ojt — 


logp 

~Y~ 


+ ClT , 


such that 




rUr. 


The essential step of applying the previous lemma is to hnd ap- We start by getting the 
convergence rate of F and B. Let V denote the mxm diagonal matrix of the hrst m largest 
eigenvalues of the sample covariance matrix in decreasing order. Recall that 


Dehne 


-Y'YF = FV. 

T 


H = -V-^F'FB'B. 

T 


Lemma C.2. The rates of convergence o/F are as follows: 

(i) ||F-FH'|| 

(ii) ||F-FH'| 


(t) ||F-FH'||p = Op(^ + ,/^), 


= Op((7t + * + £)(iogr)*), 


Proof, (i) By dehnition of F and H 


F - FH' = -(Y'Y - FB'BF')FV 


Since ||F||p = Op(a/T), ||V ^|| = Op(l/Am) from Theorem 

1 


3.1 


we have 


IF - FH'IIp < Op 


XmVr 


U'U + FB'U + U'BF'II , 
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where we used the fact ||AB||j? < || A|| HBUi?. By Lemma 6.1 


|lu'U|| = llluu'll < ||luu' - E.II + ||S„|| = Op(|), 


and since ||B||max = Op{^J\l/p) from Assumpt 


ion 


4.2 


T m 


T m p p 


E||B'U||| EEEE I ^u,iii2 |0(^) = 0(tA 


i=l j=l i=l 

Therefore by Markov inequalty, 


t=l j = l 2i = l 22 = 1 


P 


Hence, 


(ii) From (i) we conclude 


|FB'U|| < ||F||f||B'U|| =Op{Ty/^i). 


\F-FU'\\p = Op{-^ + ^l^]. 



|F-FH'IUax<Op 


y^mT 


XmVr V A, 


lU'UF + FB'UF + U'BF'FI 


Let us bound each term separately. For the hrst term, ||U'UF||max < ||U'U||oo||F||max and 

T T 

||U^U||oo = niax|u^Us| = max— E[ujUs] + E[ujUt] = Op{Ty/p + p). 


S=1 


S=1 


The second term is bounded as 11 FB'UF 11 max < nr||F||maxllB'U||fX3llF|lmax and 


B'Ulloo = max^^ \^Wt\ = 0{T^/X~i ), 

k<m ^ 


t=l 


since var(b'^Ui) = b'^S^bfc = 0(Ai). The third term can be bounded similarly. Together 
with the fact that ||B||inax = Op{^yXi/p) and i|F||jnax = Op{{\ogTy^'^^) from Assumption 


4.2, we obtain 


|F-FH'IUax<Op 


= Or 


XmT 


{p + T^) (log T) ’■2 + T A/A^(log T) ’-2 


1 P Vp 

m XmT Xm^ 


)(logT) 
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□ 


Lemma C.3. The rates of convergence for B are as follows. Two regimes are considered. 

If Am > Cip for constant Ci > 0, we have 
(t) ||H-i||=Op(l), 

(%%) ||B - BH-imax = Op(^bg^). 

//C 2 y^(logT )^/’'2 < Am < Cip for constant C 2 > 0, we have 
(T) IIH'H - Imll = Op(Cm + + I/VT), 

(iT) ||B - BH'lUa. = Op(ybg^). 


Proof, (i’) From Lemma C.2| (i) we have 


F'(F-FH')||p < Op(—)(-||F'U'UF|| + -||F'U'U|| ||F - FH|| +2||FB'U 

AmV ^T T 


We claim ||F'U'|| = Op{y/Tf)). Hence, ||F'(F - FH')||p = Op(p/Am) + Op{T/^\f). With 
this, we bound ||H'H — Im||- First obviously ||H|| = Op(l) since Ai/Am is bounded. Then 


from Fan et ah (2013), we know 


|H'H - Imll < -||F'(F - FH')||(1 + ||H||) + ||Hf ||F'F/T - I, 

= Op{Cm + 1/\/Am + 1/ VT) . 


It remains to show that ||F'U'|| = Op{\/Tp). By definition, 

||F'U'UF|| = IlUFF'U'll = sup ||F'U'x||2 < 2 sup ||F'U'xf , 

xe5P-i XSA^ 

where A/" is a 1/4-net of the unit sphere and {Afl < 9^. Since||F'U'x|p = 

YlT=i(^t<T using Chernoff bound, we have 

P(^||F'U'UF|| < 9^ • 

u^x is sub-Gaussian, so choosing t x Tp, we obtain that ||UF|| = Op{y/Tp). 

(i) In (i’), we showed ||H'H — Im|| = Op{cm + l/\/A/^ + If in addition, we know 

Am > Gip, then Cm = o(l) so that ||H'H — Im|| = op(l). So we conclude Amin(H'H) > 1/2 
with probability approaching one according to Weyl’s Theorem. Thus ||H Nl - Op{l). 
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(ii) Decompose B — BH ^ as follows: 


B - BH ^ = -YF - BH-^ 

T 

= ;^BH-^(HF' - F')F + ;^U(F - FH') + ;^UFH' 


T 


T 


T 


Fan et al. (2013) showed that 


T 


— ||UF||max= max 

1 i<p,k<m 


T 




t=l 


= Or 


logP \ 
T J ■ 


Thus the max norm of the last term is Op('\/logp/T). The max norms of the hrst and 
second terms are bounded respectively by 

^||B1|.„„||H-‘||||HF' - F'||„„ . VrllFllr = Op((^ + ^ + y^)(logr)^) . 

and by 

IIUU'UF + UFB'UF + UU'BF'F|U,. 

“ (a ) (ll II ^ m||UF||niax||B'U||oo||F||max + F || B'U || oo || U || max^ 

= (Fv^(logT)^ + v/2^(TV^)(logT)^ +T2^(log(pT))^) . 

Simplify and Combine the rates together, and note > Cip in this case and 
^(logT)^/^2 = o{Xm), we obtain, 


||B - BH-‘|U„ = Op(i((logr)4 + (log(pT))n) + = Op[^j^r) . 

(ii’) Now let us consider the other situation. We have a different decomposition of B — 

BH': 

B - BH' = -YF - BH' 

T 

= ^BF'(F - FH') + B(^F'F - I^)H' + ^U(F - FH') + ^UFH'. 

As before, the max norm of the last term is Op{^y\ogp/T). The max norms of the hrst three 
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terms are bounded respectively by 


m , 


|B| 


|F'(F - FH')||f = Op{^/^+l /^); 


V^\\B\U4 Iff - ij\ ||h'|| = Op{^x,/{pT)) ■ 


and 


O, 


A^T2 


lUU'UF + UFB'UF + UU'BFFI 


< Op 


A^T2 


T^(logT)4 + VTlogp(TyAl)(logT)^^ +T||B'UU'||, 


where ||B'UU'||max = Op{T\f\fp + \/Ai logp) is quite small. 
Simplify and Combine the rates together, we obtain, 


IB - Bniu„ ^ . A + /^) ^ . 


Proof of Theorem IB. II 

Proof. Recall that itu = yu — h'ift- We separately consider the two cases in Lemma 
Xm > Cip, so is well dehned. We have 


C.3 


□ 


If 


Uu - U^t = (b' - b'H-i)(h - Hh) + b'H-i(h - Hh) + (b' - b'H-i)Hh 


Therefore by Cauchy-Schwarz, 


maxT ^ |-Ujt — <3max ||b'H ^||^ —||F —FH' 

i<p ^ i P 


|2 

If 


t=i 


3m||B — BH 
3m||B — BH 


'‘"LxyllF-FH'fi. 


- 1||2 

max rjn 


E l|Hf< 


t=l 


It follows from Lemma C.2 and C.3 fii) that 


T 

maxT“^ 

i<p ^ ^ 

t=l 


\UU - ..r = Op(^(^ + i 


\pT\XlJ' Xj T ) T p 
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Replacing the average over t in the above inequality with maximum over t and T ^||F — 
FH'III with m IIF-FH' we can also derive bound for maxi^p^t^r \uit — Uit\. Since from 


Assumption 


4.2 


we 


have maxt<T ||ft|| = Op( (log ggt maxi<p,t<r \uit-Uit\ = op(l). 


Now if 02y^(logT)^02 < Am < Oip, we apply a different way of decomposing uu — Uu- 
Uu - Uit = (b' - b'H')(ft - Hb) + b'H'(h - Hh) + (b' - b'H')Hh + b'(H'H - Im)ft • 
Therefore by Cauchy-Schwarz, 


maxT ^ \uit — Uit\^ <4max ||b(H'||^ —||F — FH'| 

i<p i 1 

t=l 


4m||B-BH'£„pl|F-FH'l|J, 
4m||B - ||Hf,|p 


i=l 


4max||b,|n|H'H-ld V||b 

? J 


t=\ 


It follows from Lemma C.2 and C.3 (ii’) that 


T 

,2 n Ai\ f\ogp 1 

maxT y\uit-Uit\ = Op — = Op - 

i<P KpTW^T Am/ T pTJ \ T p 


Again, it is not hard to show maxj<p^t<T \uit — Uit\ = op(l). 
Finally, Lemma 


C.l 


concludes the theorem by choosing op 


cases. 


^J\ogp]T + \/l/p in both 

□ 
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